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A  Perspective  on  the  State  of  Research 
in  Fauit-Toierant  Systems 


Abstract:  As  computers  take  on  a  greater  role  in  society,  their  dependability 
is  becoming  increasingly  important.  Given  software’s  critical  role  in  computing 
systems,  reliable  software  has  emerged  as  crucial  to  achieving  a  dependable 
infrastructure.  Using  a  system  perspective  that  recognizes  the  prominence  of 
software,  we  characterize  the  current  state  of  fault-tolerance  research  as  it 
contributes  to  the  dependability  of  computer  systems  and  we  conjecture  on 
future  directions  for  this  research  area. 


1  Introduction 

We  are  becoming  increasingly  dependent  on  software-based  systems,  often  without  realizing 
it.  While  by  no  means  can  all  such  systems  be  classified  as  critical,  software  is  turning  up  ev¬ 
erywhere,  from  airplanes  and  automobiles  to  television  sets  and  electric  razors.  Also,  the  per¬ 
centage  of  software  in  these  systems  (relative  to  hardware)  is  increasing.  According  to  Randell 
[Randell  95],  the  amount  of  software  in  consumer  products  is  doubling  every  year  (e.g.,  a  top- 
of-the-line  television  now  contains  500  kilobytes  of  software). 

Consequences  of  failures  in  these  systems  can  range  from  inconvenient  (e.g.,  a  television  re¬ 
mote  control  that  will  not  work)  to  catastrophic  (e.g.,  software  in  a  commercial  aircraft  that  in¬ 
advertently  prevents  the  pilot  from  recovering  from  an  error.)  The  dependability  of  a  television 
set  may  not  be  critical  in  the  normal  sense;  however,  for  the  manufacturer  a  television’s  low 
dependability  can  translate  into  lower  sales  and  higher  warranty  repair  costs.  For  the  consum¬ 
er  an  undependable  television  may  result  In  the  Inability  to  obtain  important  information  (e.g., 
school  closings  in  a  snow  storm.) 

Fault  tolerance  has  long  been  used  as  a  means  of  improving  the  dependability  of  computer 
systems.  This  report  presents  a  perspective  on  research  in  fault  tolerance  as  it  relates  to  de¬ 
pendability  in  software-based  systems  and  attempts  to  describe  the  current  state  of,  and  out¬ 
line  future  directions  for,  this  broad  research  field.  While  references  are  made  to  the 
commercial  suppliers  of  fault-tolerant  systems,  this  is  not  intended  to  be  a  survey  of  the  capa¬ 
bilities  of  these  systems. 

The  next  section  presents  the  system  perspective  that  is  used  throughout  this  paper  and  dis¬ 
cusses  the  problems  involved  in  designing  a  system  for  dependability.  Section  3  discusses  the 
evolution  of  fault-tolerance  research  including  the  current  state  and  future  prospects.  Finally, 
Section  4  presents  a  summary  of  the  report. 
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Dimensions  of  the  Probiem  Space 


2.1  System  Perspective 

In  the  software  engineering  arena,  a  system  is  often  equated  with  software,  or  perhaps  with 
the  combination  of  computer  hardware  and  software.  Here,  we  use  the  term  system  in  its 
broader  sense.  As  shown  in  Figure  2-1 ,  a  system  is  the  entire  set  of  components,  both  com¬ 
puter  related  and  non-computer  related,  that  provides  a  service  to  a  user.  For  instance,  an  au¬ 
tomobile  is  a  system  composed  of  many  hundreds  of  components,  some  of  which  are  likely  to 
be  computer  subsystems  running  software. 


Figure  2-1  System  Relationships 


A  system  exists  in  an  environment  (e.g.,  a  space  probe  in  deep  space),  and  has  operators  and 
users  (possibly  the  same).  The  system  provides  feedback  to  the  operator  and  services  to  the 
user.  Operators  are  shown  inside  the  system  because  operator  procedures  are  usually  a  part 
of  the  system  design;  and  many  system  functions,  including  fault  recovery,  may  involve  oper- 
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ator  action.  Not  shown  in  the  figure,  but  of  equal  importance,  are  the  system’s  designers  and 
maintainers. 

Systems  are  developed  to  satisfy  a  set  of  requirements  that  meet  a  need.  A  requirement  that 
is  important  in  some  systems  is  that  they  be  dependable.  A  dependable  system  is  one  for 
which  “reliance  can  justifiably  be  placed  on  the  service  it  delivers  [Laprie  92].”  Fault  tolerance 
is  a  means  of  achieving  dependability. 

There  are  three  levels  at  which  fault  tolerance  can  be  applied — hardware,  software,  and  sys¬ 
tem  (user  interface).  All  three  levels  are  susceptible  to  design,  implementation,  or  mainte¬ 
nance  errors— human  mistakes  that  exist  as  faults  in  the  hardware,  code,  or  user  interface  and 
that  are  manifested  in  the  behavior  of  the  system.  Hardware  is  unique  among  the  three  in  that 
it  is  susceptible  to  “wear  ouf  and  damage.  Traditional  fault  tolerance  compensates  for  faults 
in  computing  resources  (hardware).  By  managing  extra  hardware  resources,  the  computer 
subsystem  increases  its  ability  to  continue  operation.  Measures  to  ensure  hardware  fault  tol¬ 
erance  include  redundant  communications,  replicated  processors,  additional  memory,  and  re¬ 
dundant  power/energy  supplies.  Management  of  this  redundancy  often  involves  the  use  of 
software.  Hardware  fault  tolerance  was  particularly  important  in  the  early  days  of  computing, 
when  the  time  between  machine  failures  was  measured  in  minutes. 

A  second  level  of  fault  tolerance  recognizes  that  a  fault-tolerant  hardware  platform  does  not, 
in  itself,  guarantee  high  availability  to  the  system  user.  Faults  can  also  arise  from  software 
components.  These  faults  occur  because  the  software  was  designed,  implemented,  or  main¬ 
tained  incorrectly.  Software  fault  tolerance  (including  mechanisms  such  as  checkpoint/restart, 
recovery  blocks,  and  multiple-version  programs)  is  used  to  compensate  for  faults  at  this  level. 

A  third  level  of  fault  tolerance  realizes  that  software  and  hardware  do  not  exist  independently 
and  that  they  execute  in  some  environment.  For  instance,  failures  can  arise  as  the  result  of 
actions  by  a  user,  whether  a  person  or  another  computer  system  through  the  operational  in¬ 
terface  between  the  system  and  user.  Measures  taken  at  this  level  are  usually  application-spe¬ 
cific  and  may  be  implemented  in  hardware  or  in  software.  At  this  level  we  have  system  fault 
tolerance. 

The  system  view  is  recursive  in  nature.  A  subsystem  may  in  itself  consist  of  both  hardware  and 
software  components.  A  simple  example  of  this  is  a  microprocessor  whose  instruction  set  is 
implemented  in  microcode.  The  microprocessor  is  subject  to  all  of  the  usual  kinds  of  hardware 
faults  (e.g.,  single  bit  upsets),  as  well  as  all  of  the  usual  kinds  of  software  faults  (e.g.,  design 
errors  in  the  microcode).  The  user  of  the  microprocessor  does  not  know  or  care  which  aspects 
of  the  subsystem  are  implemented  in  hardware  or  software.  The  microprocessor  looks  like  a 
piece  of  hardware  to  most  users.  To  the  designer  of  the  microcode,  the  microcode  looks  like 
software  running  on  a  specialized  piece  of  hardware  (the  engine  that  executes  the  microcode.) 
The  designer  will  have  to  deal  with  the  hardware  and  software,  which  together  make  up  the 
microprocessor,  and  be  prepared  to  deal  with  the  potential  misuse  of  the  microprocessor  by 
the  user. 
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2.2  Means  to  Achieve  Dependability 

The  International  Federation  for  Information  Processing  (iFIP)  Working  Group  10.4  has  iden¬ 
tified  [Laprie  92]  means  to  achieve  improved  dependabiiity  for  a  system.  These  can  be  cate¬ 
gorized  into  fault-avoidance  or  fault-tolerance  techniques.  Fault  avoidance  prevents  faults 
from  occurring  in  the  operational  system  and  includes  fault  prevention,  fauit  removal,  and  fault 
forecasting.  Fault  tolerance  compensates  for,  and  protects  against,  the  impacts  of  fauits  dur¬ 
ing  system  operation. 

Because  the  desired  property  of  a  system  is  improved  dependabiiity,  software  fault-tolerance 
research  Is  not  isolated  from  software  fault  avoidance  issues.  Understanding  how  faults  occur 
and  their  nature  is  essentiai  to  defining  strategies  for  tolerating  the  manifestation  of  fauits  in 
operational  systems.  Consequently,  issues  in  fauit-avoidance  research  are  inseparabie  from 
considerations  of  fault-tolerance  research. 

2.3  Design  for  Dependability 

When  designing  systems  where  dependabiiity  is  required,  it  is  important  to  deal  with  depend¬ 
abiiity  issues  from  the  start  by  including  fauit-tolerance  mechanisms  within  the  system  design 
and  employing  fauit-avoidance  techniques  as  appropriate  in  the  design  process.  Adding  de¬ 
pendability  after  the  fact  can  be  both  expensive  and  not  as  effective  as  designing  it  in  from  the 
start. 

Depending  on  the  nature  of  the  system,  tolerance  of  conventional  faults  or  failures  is  not  the 
oniy  important  consideration  when  designing  dependabie  systems.  Attention  should  also  be 
paid  to  potential  malicious  “threats.”  As  systems  become  more  integrated,  it  is  increasingiy  at¬ 
tractive  to  attack  them,  for  teenage  hackers  as  weil  as  foreign  terrorists.  Attacks  on  the  net¬ 
worked  computer  systems  are  becoming  common,  as  evidenced  by  the  increasing  number  of 
CERT®  advisories  appearing  on  the  Internet.  Much  of  our  nation’s  infrastructure  is  becoming 
dependent  on  fragiie,  distributed  software-dependent  systems — ^the  power  grid,  transportation 
(particularly  air  and  rail),  and  our  financial  systems.  These  system  are  potential  targets  for  in¬ 
truder  attacks. 

2.4  Social  and  Economic  Issues 

The  issues  surrounding  dependabie  systems  are  not  just  technical  in  nature.  There  are  difficult 
social  and  economic  barriers  to  be  surmounted  before  dependabiiity  practices  become  stan¬ 
dard  in  software-based  systems.  Aithough  this  report  concentrates  on  technicai  areas,  it  is  im¬ 
portant  to  recognize  the  crucial  role  of  these  social  and  economic  issues. 
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Building  dependable  systems  has  a  higher  initial  cost  than  building  a  system  without  paying 
special  attention  to  dependability.  Because  of  this,  dependability  has  been  a  hard  sell.  Except 
in  a  limited  set  of  critical  applications  with  catastrophic  failure  modes  (e.g.,  flight  control,  rail¬ 
road  signaling),  there  has  been  little  perception  or  hard  data  to  establish  that  the  extra  costs 
are  worthwhile.  This  is  changing.  Whole  industries  are  becoming  dependent  on  software  to  op¬ 
erate,  and  the  need  for  and  cost  effectiveness  of  higher  levels  of  dependability  are  being  rec¬ 
ognized.  Still,  where  the  safety  or  economic  benefit  of  dependability  is  not  obvious,  there  is 
little  demand  for  it. 

There  are  cases  where  dependability  (safety)  drives  the  design  (e.g.,  commercial  aircraft).  In 
others,  regulations  or  economic  incentives,  such  as  severe  penalties,  may  be  needed  to  help 
push  developers  into  designing  for  dependability. 

Society  demands  dependability  in  other  arenas — much  more  so  than  in  software.  Often,  the 
demand  for  dependability  is  the  result  of  some  disaster  or  series  of  disasters  that  brings  the 
dependability  issue  to  the  forefront.  Leveson  [Leveson  94]  draws  parallels  between  the  prolif¬ 
eration  of  high-pressure  steam  engines  in  the  early  steam-era  and  computer  software  today. 
Boiler  explosions  and  their  aftermath  led  to  the  routine  use  of  safety  devices  and  to  higher 
standards  of  workmanship  and  engineering.  Similarly,  it  may  require  a  major  software-related 
disaster(s)  to  cause  the  public  or  management  to  routinely  demand  dependability  guarantees 
on  software-based  systems. 

Making  systems  dependable  is  difficult  and  requires  systematic  engineering,  not  just  extensive 
user  testing.  The  additional  cost  and/or  time  involved  in  achieving  higher  levels  of  dependabil¬ 
ity  result  in  systems  with  a  higher  initial  cost.  Therefore  if  a  consumer  makes  a  purchase  de¬ 
cision  based  only  on  price,  ignoring  the  long-term  costs  associated  with  lower  dependability, 
there  is  little  incentive  for  the  supplier  to  spend  the  additional  resources  on  making  the  system 
dependable. 

One  reason  that  dependable  systems  cost  more  is  that  most  fault-tolerant  system  architec¬ 
tures  are  proprietary.  There  is  hope  that  increased  demand  will  result  in  generic  implementa¬ 
tions  which  will  bring  the  cost  down.  There  is  evidence  that  this  is  already  happening.  New 
fault-tolerant  systems  by  Tandem  [Tandem  96],  Sequoia  [Sequoia  96],  and  other  manufactur¬ 
ers  are  based  on  UNIX  and  are  much  more  open  than  their  predecessors. 
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3  Evolution  of  Fault-Tolerance  Research 


Research  in  the  field  of  computer  system  fault  tolerance  has  evolved  throughout  the  history  of 
electronic  computing.  The  unreliability  of  vacuum  tubes  and  related  discrete  components  of 
the  earliest  electronic  computers  necessitated  conservative  engineering  and  constant  atten¬ 
tion,  if  not  extensive  redundancy.  For  example,  ENIAC  included  over  17,000  vacuum  tubes 
and  related  components  totalling  more  than  100,000  discrete  parts.  The  key  designers  of  the 
ENIAC  realized  that  the  success  of  the  entire  project  rested  on  achieving  component  and  sys¬ 
tem  reliability  [Goldstine  72]. 

As  the  field  of  computer  design  evolved,  inherent  component  reliability  and  overall  hardware 
reliability  increased  [Gray  91].  These  improvements  were  due  to  the  evolution  of  hardware 
technology  (e.g.,  vacuum  tubes  to  transistors  to  integrated  circuits  and  higher  levels  of  com¬ 
ponent  integration),  as  well  as  the  advancement  of  fault-avoidance  and  fault-tolerance  tech¬ 
niques.  Fault  avoidance  helped  improve  the  quality  of  both  the  components  and  the  systems, 
while  fault  tolerance  found  its  greatest  application  in  high-availability  and  continuous-perfor¬ 
mance  systems,  where  computers  began  to  take  on  increasingly  critical  roles  in  commercial 
and  industrial  endeavors  (e.g.,  process  control  and  information  processing  and  storage). 

An  increase  in  overall  complexity  and  capability  accompanied  the  evolution  of  computer  tech¬ 
nology,  with  much  of  this  increase  implemented  within  software.  Consequently,  software  be¬ 
came  a  larger  and  indispensable  part  of  systems.  With  this  growth,  software  began  to  emerge 
as  a  significant  contributor  to  system  unreliability  [Gray  90]. 

While  software  reliability  issues  continue  to  play  a  critical  role  in  achieving  higher  levels  of  de¬ 
pendability,  the  ubiquitous  role  of  software  has  elevated  human-computer  interaction  (HCI)  is¬ 
sues  into  prominence.  This  situation  is  especially  evident  in  applications  where  software¬ 
intensive  systems  have  replaced  traditional  hardware-intensive  systems  as  the  primary  tech¬ 
nologies  for  information  presentation  and  interaction  (e.g.,  control  panels  replaced  with  graph¬ 
ical  display  units).  Within  these  domains  and  in  the  operation  of  these  systems,  human  error 
has  been  responsible  for  more  downtime  than  generally  recognized  and,  as  a  result,  has  had 
a  significant  effect  on  system  performance  and  dependability  [Maxion  96,  Velpuri  95]. 

It  is  within  the  context  of  computer  system  evolution  that  fault-tolerance  research  has  contrib¬ 
uted  substantially  to  improved  system  dependability.  It  is  this  evolution  that  has  resulted  in  in¬ 
creasing  research  into  the  broader  (non-hardware  oriented)  aspects  of  system  dependability. 
Reliable  software,  user  interfaces,  and  inter-application  interfaces  within  software-intensive 
systems  have  emerged  as  important  considerations  in  fault-tolerance  research. 

The  broad  focus  has  stimulated  research  into  understanding  the  effect  of  software  develop¬ 
ment  practices  on  system  reliability.  In  addition,  researchers  are  investigating  the  characteris¬ 
tics  of  software  architectures  that  will  enable  the  attributes  (specifically  dependability)  of 
software-intensive  systems  to  be  predicted. 
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3.1  Issues  Facing  Fault-Tolerance  Researchers 

As  reflected  in  specialized  conferences,  sessions  within  general  conferences,  and  journal  pub¬ 
lications,  fault  tolerance  is  a  very  active  research  field.  Much  of  today’s  work  is  aimed  at  incre¬ 
mental  improvement,  maturing  basic  understanding  in  specific  technology  focus  areas,  and 
expanding  the  applicability  of  fault-tolerance  technologies.  There  are  no  overriding  technolog¬ 
ical  approaches  that  dominate  or  help  focus  research,  and  the  community  is  attacking  pieces 
of  the  technology,  problems,  and  solutions.  This  situation  is  in  contrast  to  a  few  years  ago 
where  a  central  theme  (i.e.,  n-verslon  programming)  typified  much  of  the  software  fault-toler¬ 
ance  research  and  attracted  a  multitude  of  researchers. 

3.1.1  Broad  Range  of  Issues 

This  section  includes  a  representative  set  of  active  research  topics  in  fault  tolerance.  These 
range  from  fundamental  fault-tolerance  design  and  implementation  issues  (e.g.,  checkpoint 
restart,  distributed  algorithms)  to  fault-avoidance  approaches  (e.g.,  formal  methods).  The  wide 
range  of  topics  and  the  intenweaving  of  fault-avoidance  with  classical  fault-tolerance  issues 
are  indicative  of  the  broad  and  complementary  nature  of  these  research  areas. 

Checkpoint  Restart 

Checkpoint  restart  strategies  are  backward  recovery  techniques  for  saving  the  state  of  a  sys¬ 
tem  to  enable  resuming  operation  from  a  well-defined  state  [Pradhan  96,  Jalote  94].  This  is 
classic  fault-tolerance  issue  that  is  now  being  addressed  in  the  context  of  increasingly  complex 
and  distributed  systems.  Much  of  the  current  research  work  involves  issues  relating  to  reliable 
high  performance  checkpointing  [Plank  95]  and  distributed  systems  [Wang  95,  Smith  95]. 

Distributed  Algorithms 

Unlike  uniprocessor-based  systems,  distributed  systems  present  a  new  set  of  challenges  to 
achieving  dependability.  Clocks  of  the  processors  often  must  be  synchronized,  data  must  sur¬ 
vive  failures  of  individual  processors,  nodes  must  fail  in  controlled  ways,  and  the  communica¬ 
tions  between  the  processors  must  be  reliable.  Techniques  such  as  interactive  consistency, 
fail-stop  processors,  and  reliable  transport  mechanisms  have  been  developed  to  deal  with 
these  challenges.  Many  of  the  mechanisms  are  expensive  to  implement,  and  researchers  con¬ 
tinue  to  look  for  more  efficient  algorithms  [Jalote  94]. 

Group  Communications 

In  distributed  systems  high  availability  can  be  achieved  by  replicating  state  among  groups  of 
servers.  As  long  as  the  state  is  consistent  between  the  servers,  the  system  is  able  to  recover 
from  the  failure  of  one  of  them.  This  replica  consistency\s  often  achieved  by  using  group  mem¬ 
bership  protocols,  which  ensure  that  all  running  servers  share  a  common  view  of  the  system 
configuration,  and  atomic  broadcast,  which  ensures  that  each  server  is  updated  correctly. 
Work  in  this  area  is  widespread,  including  the  University  of  California  at  San  Diego  [Cristian 
96]  and  Cornell  University  [Schneider  90]. 
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Fault  Tolerance  in  Human  Computer  Interaction  (HCI) 

Issues  and  perspectives  in  fault-toierance  research  are  expanding  from  internai  software  state 
issues  to  considerations  of  interfaces  between  systems  and  between  humans  and  computers. 
This  broadened  perspective  includes  considerations  of  the  environment  and  its  effect  on  soft¬ 
ware  and  overall  system  dependability. 

Questions  such  as,  “How  wiii  this  technology  affect  usability  and  safety?”  and  “How  do  human 
errors  contribute  to  system  downtime?”  are  being  addressed  [Neumann  95],  [Gray  91].  These 
investigations  involve  both  fault-tolerance  and  fault-avoidance  issues  in  trying  first  to  predict 
how  errors  will  be  introduced  by  human  operators,  under  what  conditions  they  will  be  intro¬ 
duced,  what  types  of  errors  will  be  introduced,  and  how  to  tolerate  these  occurrences  [Maxion 
96,  Velpuri  95,  Ladkin  96]. 

Fault  Injection 

Fault  injection  is  a  technique  for  evaluating  the  dependability  of  a  system.  It’s  purpose  is  to  test 
the  ability  of  a  system  to  detect  and  recover  from  faults.  As  a  result  of  running  fault-injection 
experiments,  designers  are  able  to  determine  the  fault  coverage  of  their  system.  Fault  injection 
involves  seeding  the  system  with  faults  under  controlled  conditions  and  observing  its  behavior. 
Many  tools  have  been  developed  to  aid  in  fault  injection  including  FIAT  at  Carnegie  Mellon 
University  [Barton  90],  FERARRI  at  the  University  of  Texas  at  Austin  [Kanawati  95],  and  DE¬ 
PEND  at  the  University  of  Illinois  [Ries  96]. 

Measurement  and  Interpretation 

A  significant  difficulty  in  evaluating  fault-tolerance  and  fault-avoidance  mechanisms  is  the  lack 
of  real-world  data.  Without  this  data  it  is  impossible  to  determine  the  efficacy  of  a  method  in  a 
deployed  system.  Such  data  exist  but  are  often  proprietary.  Some  researchers  have  obtained 
access  to  such  data  and  have  been  able  to  tune  techniques  and  improve  systems  as  a  result. 
Notable  examples  of  work  in  this  area  have  taken  place  at  the  University  of  Illinois  and  Tandem 
Computers  [Lee  95,  Thakur  95],  and  at  IBM's  T.  J.  Watson  Research  Center  [Chillarege  95]. 

Reliability  Modeling 

Reliability  modeling  is  the  modeling  of  faults  and  errors  with  the  intention  of  predicting  future 
behavior  [Musa  90,  Laprie  96,  Farr  96].  Traditional  hardware  reliability  models  are  empirically 
based  and  reflect  physical  characteristics  of  hardware  failures,  e.g.,  random  failure  models.  In 
contrast,  software  and  system  failure  models  lack  the  physical  data  to  guide  reliability  model¬ 
ing  and  require  reliance  on  usage  data.  Usage  data  is  highly  problematic  both  to  collect  and, 
because  of  the  dependency  of  software  reliability  on  its  use,  to  generalize  across  systems. 

Recent  research  in  reliability  models  for  predicting  the  future  behavior  of  a  system  has  focused 
on  extending  models,  including  complex  and  distributed  systems,  and  addressing  software  re¬ 
liability  and  its  impact  on  overall  system  reliability.  Summaries  of  these  are  available  in  a  vari¬ 
ety  of  publications  [Pham  92,  Pham  95,  Lyu  96]. 
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Formal  Methods 

The  use  of  formal  methods  is  a  fault-avoidance  technique  which  helps  to  achieve  dependable 
systems  by  locating  inconsistencies  in  their  design  or  implementation.  The  application  of  for¬ 
mal  methods  to  dependable  system  may  involve  detailed  hand  analysis,  semi-automatic  anal¬ 
ysis  (e.g.,  tools  to  help  tabulate  data),  or  the  use  of  complex  theorem  provers.  No  matter  which 
techniques  are  used,  the  use  of  formal  methods  can  often  find  extremely  subtle  problems  but 
scaling  to  large  systems  can  be  problematic.  SRI  International  [Rushby  95],  the  National  Aero¬ 
nautics  and  Space  Administration  (NASA)  [Suiter  95],  and  the  Naval  Research  Laboratory 
(NRL)  [Heitmeyer  96]  are  among  the  many  organizations  working  in  this  area. 

Object-Oriented  Systems 

Advances  in  the  implementation  of  object-oriented  systems  and  their  expansion  into  critical 
and  real-time  applications  has  stimulated  work  attempting  to  integrate  fault-tolerance  and  ob¬ 
ject-oriented  technologies  [Xu  95,  Landis  95].  Specific  efforts  include  addressing  the  critical 
fault-tolerance  issues  of  partial  failures  and  consistent  ordering  of  events  in  distributed  sys¬ 
tems  that  are  not  adequately  handled,  for  example,  in  Object  Request  Broker  (ORB)  or  Com¬ 
mon  Object  Request  Broker  Architecture  (CORBA)  technologies  [Maffeis  95]. 

Innovative  Applications 

This  research  involves  applying  fault  tolerance  in  new  ways  to  solve  problems  that  have  not 
traditionally  been  associated  with  fault  tolerance.  It  is  based  on  abstracting  the  results  of  fault- 
tolerance  research  and  mapping  these  to  other  problem  spaces.  Examples  include  the  use  of 
fault  tolerance  to  enable  dependable  system  upgrade  [Sha  96]  and  the  application  of  fault  tol¬ 
erance  to  information  survivability  and  security  problems  [Randell  95,  Wilken  95]. 

3.1.2  Technology  Transition 

An  emerging  perspective  that  is  gaining  greater  emphasis  is  the  challenge  associated  with 
transferring  existing  fault-tolerance  technology  into  widespread  practice. 

Recently,  the  High  Assurance  Computing  Workshop  team  (which  was  composed  of  50  experts 
from  academe,  government,  and  industry),  identified  two  primary  problems  associated  with 
high-assurance  (dependable)  computing.  Both  problems  involved  the  application  of  depend¬ 
able  computing  to  the  development  of  practical  systems.  These  problems  were  (1)  lack  of 
technology  to  support  the  application  of  dependability  techniques  and  (2)  lack  of  a  unified 
framework  for  building  systems  that  satisfy  several  critical  properties  (e.g.,  dependability  and 
performance)  simultaneously  [McLean  95]. 

Evidence  of  efforts  to  transfer  fault-tolerance  techniques  is  seen  in  recent  publications.  For  ex¬ 
ample,  in  the  field  of  software  fault  tolerance  two  volumes  edited  by  Michael  Lyu  [Lyu  95,  Lyu 
96]  attempt  to  capture  the  state  of  software  fault  tolerance  and  software  reliability.  Similarly 
[Jalote  94,  Pradhan  96,  Siewiorek  92]  capture  broader  design  practices  for  fault-tolerant  sys¬ 
tems.  These  efforts  are  representative  of  the  process  of  codifying  the  field  toward  integrating 
fault-tolerance  techniques  into  more  general  engineering  practice. 
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As  with  any  transition  challenge,  the  transition  of  fault-tolerance  technologies  must  address 
not  only  technical  but  economic  issues.  This  must  include  engineering  studies  and  validated 
data  on  the  impact  (e.g.,  performance,  cost)  and  the  benefits  (e.g.,  reduced  downtime,  lower 
maintenance  costs)  involved  in  engineering  more  dependable  software-intensive  systems. 

3.1.3  Difficult  Challenges 

While  one  might  conjecture  that  the  field  is  mature,  based  on  the  emergence  of  the  importance 
of  transition,  this  is  not  a  completely  accurate  description.  Rather,  while  substantial  success 
has  been  realized  in  the  application  of  established  techniques,  the  system  and  software  fault- 
tolerance  research  communities  still  face  complex  problems  that  are  too  challenging  for  cur¬ 
rent  technologies.  As  an  example,  published  work  has  identified  the  difficulties  in  black  box 
testing  of  software  systems  for  high-reliability  applications.  In  some  cases,  this  work  has  de¬ 
scribed  the  infeasibility  of  proving  ultra-high  levels  of  software  reliability,  even  with  the  use  of 
advanced  testing  methods  [Littlewood  91 ,  Butler  93]. 

Similarly,  challenges  such  as  proving  levels  of  Independence  among  various  software  ver¬ 
sions  and  establishing  reliable  (provable)  error-detection  mechanisms  remain  unanswered. 
For  example,  to  determine  if  an  error  has  occurred,  a  reference  (a  “correct”  value)  must  be 
used.  In  hardware,  the  output  of  redundant  components  can  be  used  as  a  reference.  In  this 
case,  the  challenge  is  the  reliable  implementation  of  the  mechanism  to  compare  outputs  with 
the  reference(s).  In  software  the  challenge  is  not  only  to  implement  the  comparison  but  to  es¬ 
tablish  a  credible  reference  across  the  system’s  full  operating  environment,  a  reference  that  is 
itself  sufficiently  reliable. 

Current  fault-tolerance  research  areas  are  varied  and  there  is  no  obvious  central  direction  for 
future  research  activities.  There  is  a  sense  in  the  community,  as  evidenced  by  informal  discus¬ 
sions  at  conferences  and  workshops,  of  a  need  to  develop  more  of  a  focus,  or  at  least  more 
clearly  define  the  state  of  fault-tolerance  research.  The  situation  can  be  characterized  as  one 
that  is  opportune  for  a  major  breakthrough,  perhaps  a  paradigm  shift  that  would  enable  funda¬ 
mental  improvement  in  predictably  achieving  software  and  system  dependability  and  safety. 

3.2  Prospects  for  Fault-Tolerance  Technologies 

Building  upon  the  assessment  of  the  current  state  of  fault-tolerance  research,  this  section 
summarizes  potential  directions  for  technologies  and  research  in  fault  tolerance.  These  are  di¬ 
vided  into  those  areas  that  are  advancing  the  technology  Itself  and  those  that  are  making  fault- 
tolerance  technology  more  widely  available. 

3.2.1  Advancing  the  Technology 

The  core  of  the  activity  in  fault-tolerance  research  will  center  on  advancing  fundamental  ap¬ 
proaches  within  the  discipline.  These  directions  will  likely  involve  four  broad  aspects:  new 
breakthroughs,  software,  user  Issues,  and  proof  of  correctness. 
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New  Breakthroughs 

Redundancy  and  diversity  of  implementation  have  been  cornerstones  of  system  and  software 
fault  tolerance.  But  given  the  current  situation,  significant  advancements  in  system  fault  toler¬ 
ance  may  be  realized  only  through  a  fundamentally  new  paradigm  (e.g.,  a  revolutionary  fault- 
tolerance  approach,  the  identification  of  methods  that  prove  the  degree  of  independence  of 
various  software  versions,  or  dramatic  advances  in  the  application  of  formal  methods  or  proof 
of  correctness).  Some  of  these  approaches  may  involve  post-implementation  testing  or  mod¬ 
ifications  to  software  development  practices.  This  work  may  very  well  profit  from  general  re¬ 
search  in  understanding  software  engineering  practices,  software  structure,  and  software 
attribute  prediction. 

Software  Fault  Tolerance 

Software  faults  are  idiosyncratic  to  a  software  implementation  and  represent  logical  or  design 
errors  made  by  human  beings.  While  the  use  of  software  dictates  how  and  when  faults  are 
manifested,  software  and  software  reliability  embody  the  results  of  human  efforts  to  interpret 
and  understand  the  interactions  of  complex  structure  and  behavior. 

These  logical  or  design  errors  in  software  mirror  research  problems  found  in  software  engi¬ 
neering  generally.  Much  of  the  current  and  future  research  in  software  engineering  seeks  to 
establish  patterns  in  the  structure  of  software  [Clements  95]  and  to  better  understand  the  prac¬ 
tice  of  engineering  software  systems.  In  the  future,  fault  tolerance  research  will  find  greater 
synergy  with  and  benefit  from  the  research  in  architectures,  patterns,  objects,  and  related 
technologies.  These  efforts  may  form  the  foundation  for  enabling  greater  predictability  in  the 
reliability  and  dependability  attributes  of  software  systems. 

User  Environment 

Continuing  issues  related  to  the  interplay  of  the  user  and  the  system  will  provide  a  substantial 
challenge  for  researchers.  These  issues  will  require  more  interdisciplinary  work,  especially  be¬ 
tween  fault-tolerance  and  HCI  researchers,  and  will  require  understanding  the  interrelation¬ 
ships  between  technology  and  its  role  within  society. 

Proof  of  Correctness 

At  least  for  critical  parts  of  a  system,  such  as  error  detection  and  recovery  software,  a  concert¬ 
ed  effort  to  find  new  directions  in  “proving”  the  correctness  of  software  may  be  necessary. 
Some  possible  directions  include  the  use  of  powerful  abstractions  to  reduce  the  complexity 
and  more  “natural”  specification  languages  which  would  be  embraced  by  programmers.  Such 
techniques  are  being  applied  successfully  to  complex  hardware  designs.  (Intel,  IBM,  Motorola, 
AMD,  HP,  DEC,  SUN,  SGI,  etc.,  all  have  formal  verification  groups;  Intel,  for  example,  has  a 
large  number  of  researchers  working  on  the  problem.^) 


^  The  ideas  expressed  in  this  paragraph  reflect  a  private  communication  between  the  authors  and  Jacob  Abra¬ 
ham  of  the  University  of  Texas  at  Austin. 
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3.2.2  Making  the  Technology  Available 

Much  of  the  research  in  fault  tolerance  will  continue  to  support  making  these  techniques  avail¬ 
able  to  a  broader  community  of  practice  and  applying  them  to  more  domains.  The  exact  direc¬ 
tion  of  these  efforts  will  likely  be  driven  by  the  technologies  that  emerge  as  important  to  society 
(e.g.,  World  Wide  Web,  mobile  computing  [Pradhan  96]).  Work  in  this  area  will  focus  on  the 
following  three  issues:  repackaging  and  improving  fault-tolerance  technologies,  transitioning 
the  technology,  and  extending  the  technologies  to  other  areas. 

Repackage  and  Improve 

Work  will  continue  on  new  and  optimized  algorithms  and  enhanced  implementations  of  fault- 
tolerance  technologies.  Much  of  this  work  will  be  aimed  at  structuring  building  blocks  for  use 
and  reuse.  For  example,  some  of  these  advances  may  Include  higher  quality,  more  robust  al¬ 
gorithms  for  complex  highly  distributed  systems;  enhanced  approximations  toward  ensuring 
near  absolute  fail-stop  capabilities;  and  reuse  of  fault-tolerant  software  based  upon  architec¬ 
tural  patterns. 

Transition  Mission 

As  fault-tolerance  technology  matures.  Increasing  attention  will  likely  turn  to  implementation 
and  use  in  real  systems.  Investigations  may  include  determining  the  practicality  of  techniques 
and  whether  these  will  scale  to  large  real-world  systems.  These  efforts  will  also  need  to  con¬ 
sider  the  social  and  economic  aspects  of  the  technology. 

Extend  to  Other  Pressing  Problem  Areas 

Efforts  will  likely  continue  in  the  application  of  fault-tolerance  techniques  to  other  problem 
spaces  and  in  the  integration  of  fault-tolerance  techniques  with  related  or  supporting  technol¬ 
ogies.  These  directions  will  also  foster  a  broad  systems  orientation  in  developing  software-in¬ 
tensive  systems.  Current  examples  include  the  use  of  Simplex  [Sha  96]  for  system  upgrade, 
use  of  commercial  off-the-shelf  (COTS)  components  in  dependable  systems,  integrating  ob¬ 
ject-oriented  approaches  with  fault  tolerance  in  reflective  programming  [Xu  95],  and  the  appli¬ 
cation  of  fault  tolerance  to  information  survivability  problems  [Randell  95]. 
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4  Summary 


This  report  presents  a  perspective  on  the  directions  of  research  in  the  field  of  fault  tolerance 
for  dependable  computing  systems.  The  field  of  fault  tolerance  encompasses  hardware,  soft¬ 
ware,  system,  and  user  issues.  In  the  past,  most  research  has  centered  on  developing  new 
techniques  to  achieve  dependable  systems  and  on  the  details  of  system  design  and  perfor¬ 
mance.  Increasingly,  though,  the  issues  of  user  errors  and  the  effects  of  the  environment  of 
use  are  being  recognized  as  critical  to  overall  system  dependability. 

Reliable  software  will  continue  to  be  a  key  factor  in  achieving  dependable  system  perfor¬ 
mance.  Future  research  will  need  to  address  techniques  for  not  only  measuring  but  predicting 
the  dependability  of  software-intensive  systems.  Within  this  area,  opportunities  will  exist  for 
synergy  between  fault  tolerance  and  software  engineering  research. 

While  future  progress  will  likely  be  incremental,  the  community  appears  poised  for  fundamen¬ 
tally  new  developments.  As  a  context  for  these  developments,  and  perhaps  as  precursors  to 
them,  important  challenges  exist.  Some  of  these  include 

•  software  reliability  prediction  and  measurement 

•  black-box  testing 

•  user  and  usability  issues 

•  repackaging  and  improvements 

•  reliable  error  detection 

•  proving  the  independence  of  software  components 

•  extension  to  other  problem  domains 

•  transition  of  the  technology  to  industry  (real-world  issues) 

•  information  survivability 

These  problems  are  not  just  technical;  there  are  also  social  and  economic  issues  that  have 
broad  effect  on  the  research  and  adoption  of  fault-tolerance  technologies. 

This  is  a  living  document  that  represents  the  current  thinking  of  the  authors.  Readers  are  in¬ 
vited  to  help  improve  this  document  by  sending  comments  to  the  authors  at 

dependable.software  @sei.  emu.  edu 
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